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Abstract: After a brief review of recent advances in sequential analysis involving 
sequential generalized likelihood ratio tests, we discuss their use in psychometric test- 
ing and extend the asymptotic optimality theory of these sequential tests to the case 
of sequentially generated experiments, of particular interest in computerized adaptive 
testing. We then show how these methods can be used to design adaptive mastery 
tests, which are asymptotically optimal and are also shown to provide substantial 
improvements over currently used sequential and fixed length tests. 
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1 1. Introduction 

Sequential analysis of data is used in many types of psychometric tests. Some 
of these are computerized adaptive testing, classroom interaction assessment 
and intervention, psychological studies involving longitudinal data, depression 
diagnosis, and crime-suspect identification tests. The purpose of this article 
is to show how powerful techniques in modern sequential analysis can be used 
to design efficient testing procedures. In particular, we focus on computerized 
adaptive testing and show how these techniques can lead to substantial improve- 
ments over previous sequential procedures as well as conventional tests that do 
not incorporate early stopping. 

Computerized adaptive testing (CAT) has been extensively studied in the 
psychometric literature as an efficient alternative to paper-and-pencil tests. 
By selecting an examinee's kth test item based on his/her responses to items 
1, . . . , k — 1, a CAT is tailored to the individual taking the examination and 
is thus intended to quickly home in on each examinee's ability level. When 
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the test is designed to measure only one trait, the ability level is typically de- 
noted by 9 to conform to the notation in standard item response theory. There 
is substantial literature on efficient estimation of 9 in CAT applications (van 
der Linden and Pashley, 2000; Chang and Ying, 2003) and on the problem of 
classifying examinees as either masters or non-masters in a given content area 
(Reckase, f983; Lewis and Sheehan, f990; Chang, 2004; Chang, 2005). The lat- 
ter problem, known as computerized mastery testing (CMT), can be formalized 
by setting a cut point 9q and defining an examinee as a master if and only if 
his/her ability level 9 meets or exceeds that cut point. 

Typically, a CMT assumes a so-called "indifference region" (9_,6 + ) con- 
taining 9q, which may be thought of as the ability values which are close 
enough to the cut point that neither a decision of mastery nor a decision of 
non-mastery would result in a serious error. The statistical hypothesis of mas- 
tery is then given by Hq : 6 > 9 + , while the hypothesis of non-mastery is given 
by Hi : 9 < 9_. In a CMT, it is often the case that an examinee can be quickly 
identified as a master or a non-master if that examinee's ability is substantially 
higher or lower than the cut point. Therefore, CMT often involves variable- 
length testing whereby the number of items administered varies by examinee. 
An important goal in CMT is to strike a balance between the confidence of a 
correct decision and the economy of the number of items administered. There 
are thus two essential components of any CMT: (i) the stopping rule that deter- 
mines when to cease testing and make a classification decision; (ii) the method 
used to select items adaptively based on an examinee's item response pattern. 

The sequential probability ratio test (SPRT; Wald, 1947) has been stud- 
ied as a candidate stopping rule (Spray and Reckase, 1996; Eggen, 1999; Vos, 
2000; Chang, 2004) for CMT. The SPRT has shorter average test lengths than 
fixed-length tests with the same type I and II error rates at two specific points 
along the 9 scale. Although it has shorter average length, the SPRT does not 
constrain the maximum number of items administered. For a test to have no 
more than N items, it is necessary to use a truncated SPRT (TSPRT), which 
halts testing and makes a classification decision once N items have been admin- 
istered. Suppose that k items have been presented to an examinee, yielding the 
responses u\, . . . , Uk, where 

_ J 1, if the examinee answers the zth item presented correctly , , 
1 \ 0, if the examinee answers the zth item presented incorrectly. 

The classical theory of the SPRT assumes independence of responses so that 
the likelihood of 9 is 

k 

Lk(0) = f[\ Pi (e)Ml-p i (9)) 1 - u \ (2) 

where Pi(9) = Pe{iii = 1} for an examinee of ability 9. The SPRT stops after 
the fcth item and rejects Hq : > 9 + if 
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or accepts Hq if 

-z^-* (4) 

where A, B > are chosen so that Pg + {reject H } — a and Pg_ {accept H } — 
(3. Wald's (1947) approximation yields 

A = log((l - a)/P), B = log((l-P)/a). (5) 

The TSPRT stops with ^ or Q for k < N, and if stopping does not occur 
with the (N — l)st item, it rejects Ho if and only if 

For the TSPRT, Spray and Reckase (1996) and Eggen (1999) still use Q for 
the values of A and B and use for ([6| the value 

C=(A-B)/2. (7) 

The motivation for Q and ^ is that all examinees classified as non-masters 
at the ATth item have a log-likelihood ratio no further from A than —B, and 
those classified as masters have a log-likelihood ratio no further from — B than 
A. Since ^ is based on the error rates of the untruncated SPRT, the true error 
rates of the truncated procedure, whose decision at truncation is given by (|6| 
and fh, are often substantially inflated (see Table[I]below). This is of particular 
concern in CMT, where a represents the percentage of proficient examinees who 
are failed. 

We address herein this problem by using a new class of stopping rules, re- 
cently introduced in the sequential analysis literature for testing the composite 
hypotheses H versus Hi subject to type I and II error probability constraints 
and a prescribed maximum number of observations. These tests use the gener- 
alized likelihood ratio (GLR) statistics instead of simple likelihood ratios and 
have been shown to have certain optimality properties when the observations 
are independent and identically distributed (i.i.d.) and whose common distri- 
bution belongs to an exponential family. In a CAT, the successive responses 
Ui, i*2, ■ ■ • of an examinee, however, are not identically distributed and may not 
even be independent if the items are chosen adaptively, since most CATs choose 
the next item to be an unused item in the available item pool according to 
some criterion. This is also another reason besides the truncation issue why 
the theory of the SPRT is not applicable to CMTs. We show in Section 2.2 
that modern sequential testing theory can in fact accommodate this adaptive 
feature in sequential experimentation in addition to providing efficient stopping 
and terminal decision rules. In fact, the methodology developed in Section 2, 
which is illustrated by applications to CMTs, is applicable to a large variety 
of psychometric tests, allowing sequential choice of experiments (items in the 
CMT context) and providing a powerful test at the conclusion of the study 
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that satisfies the prescribed type I error probability constraint and whose ex- 
pected sample size is nearly optimal and can be considerably smaller than the 
prescribed maximum sample size. 

This paper is organized as follows. Section 2 first gives a review of recent 
developments in sequential GLR tests of composite hypotheses based on i.i.d. 
observations from an exponential family. Then the i.i.d. assumption is removed 
and the theory is extended to the case where experiments are chosen adaptively 
to generate an observation (response) at the next stage. The methodology is 
then applied to the design of efficient CMTs, in which the sequential choice of 
experiments corresponds to sequential selection of items to be administered to 
an examinee based on item response theory. Section 3 reports simulation studies 
of the performance of the proposed CMT and compares it with commonly-used 
fixed-length tests and TSPRTs. Section 4 gives some concluding remarks. 

2 2. Modern Sequential Methods and Their Ap- 
plications to CMT 

2.1 2.1 Efficient Sequential GLR Tests for I.I.D. Observa- 
tions 

To summarize recent advances in sequential hypotheses testing in a general 
framework that is applicable to psychometric testing including CMTs, let X\,X2, ■ ■ ■ 
be i.i.d. observations from an exponential family of densities fg(x) = e 6x ~^^ 
and let L k (9) denote the likelihood 

k 

L k (6) = l[fe(X j ). 

3 = 1 

The SPRT, which uses the simple likelihood ratio \og(Lk{9-) / L k {9+)) to test 
the hypotheses Hq : 9 > 9 + versus H\ : 9 < 6-, is only optimal in the rare 
case that is exactly 9^ or 9 + and has to allow the possibility of many more 
than observations being taken. A powerful technique in modern sequential 
analysis that allows the type I error probability to be controlled while having a 
maximum sample size N and preserving asymptotic optimality over the entire 
parameter space (instead of just at 9+ or 9 ) is the modified Haybittle-Peto test 
(Lai and Shih, 2004) . Let 9 k denote the maximum likelihood estimator (MLE) of 
9 based onXi,..., X k . The modified Haybittle-Peto test involves replacing the 
simple likelihood ratios in and by the GLR statistic L k (9 k ) / 'L k (6'), 

which "self-tunes" to information about the true 9 accumulating in 9 k over the 
course of the test and in which 9' denotes the appropriate alternative that will 
be specified below. Lai and Zhang (1994) and Lai (1997, 2001) have shown that 
sequential GLRs are efficient in many testing problems when the thresholds (e.g., 
A, B in ([3]), Q) are appropriately adjusted, even when 9 is multidimensional. 
However, the distribution of the GLR is generally more complicated than the 
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simple likelihood ratio, and the classical approximations Q do not apply. But 
with the modern computing power that is readily available to practitioners, 
Monte Carlo simulation or recursive numerical methods are viable and often 
the preferred methods for computing the thresholds, especially in light of the 
inflated error probabilities that result from using classical approximations with 
truncated tests; see Jennison and Turnbull (2000, Chapter 19) and Lai and Shih 
(2004). 

The modified Haybittle-Peto test of the hypotheses Hq : 9 > 9 + and H\ : 9 < 
9- can be described as follows. If N is the maximum number of observations 
and a, f3 are the desired type I and II error probabilities, then there is a value 
9^ < 9 + such that the likelihood ratio test of 8 = Q + versus 9 = 9^ based on 
N observations has type I and II error probabilities a and /3; in this sense 9^ 

is referred to as the implied alternative. Note that 9_ ^ is not necessarily equal 
to 0_, but it is the appropriate alternative to consider given the parameters 
N,a,j3, and Q+. In addition, focusing on the implied alternative 9_ frees us 
from having to specify the alternative which is often chosen arbitrarily in 

practice. A detailed example of how to compute 8_ is given in Section 3.1. 
Let < p < 1. For pN < k < N, the modified Haybittle-Peto test stops after 
the fcth item and rejects Hq if 

9 k <9 + and log ^4 > A, (8) 

or accepts Hq if 

k >8™ and log^%^>B, (9) 

for some constants A and B. For k = N, the test is always terminated, with 
Hq rejected if and only if 

9 N <9 + and log ^ C ^ 

for some constant C. If both (JsJ) and ([9| hold for some k (which can only 
happen when A and B are artificially small), then either decision can be made, 
for example, always accepting H or deciding based on 9 k . In CMT, where the 
false negative rate is critical, a simple approach is to classify as proficient, i.e., 
accept Hq, when this occurs; we take this as the definition here. 

Next the thresholds A,B, and C are chosen so that the false negative error 
rate does not exceed a and the false positive error rate, at the alternative 9^ 
implied by the maximum number N of observations, is close to j3. Specifically, 
A, B, and C will be chosen so that 



occurs f° r some k < N} = s/3, (11) 
Pg + {([8| occurs for some k < N, Ej§ does not occur for any j < k} = {32) 
Pe+{®, (|9| do not occur for any k < N, occurs} = (1 - e)a (13) 
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for some < e < 1. In practice any value of e giving a test with desirable 
properties can be used, and Lai and Shili (2004) have shown that values 1/3 < 
e < 1/2 work well in a variety of settings. The values of A, B, and C that satisfy 
( 11 )-( 13 1 can be determined by Monte Carlo simulation, a detailed example of 



which is given in Section 3.1, or by numerical methods based on the following 
normal approximation to the log- likelihood ratios: When 9 is the true parameter, 



sign(tf fe 



2fclog 




N(0,k) 



(14) 



for large k, with independent increments Z k — Z k -\ (with Zq = 0). The normal 



approximation ( 14 1 suggests replacing the signed-root statistic Z\. by a sum of 
independent standard normal random variables = Y\ + • • • + ~ -/V(0, k) so 
that, for example, the condition ^ becomes Sk/Vk < —V2A. Then, in place 
of (|TT|)-(13), B, C, and A can be successively found by solving 



P{S k /Vk > V2B for some k < N} = eft 
P{S k /Vk < -V2A for some k < N} — ea 
P{S k /Vk > -V2A for all k < N, S N /VN < 



/2C} = (1 



(15) 
(16) 
•e)a.(17) 



The left hand sides of ( 15 )- ( 17 ) can be computed by recursive one-dimensional 



numerical integration; see Jennison and Turnbull (2000, Chapter 19) for a more 
detailed discussion. 

Closed- form approximations to the probabilities in (|Tl])-( 13 ) have been devel- 
oped by Siegmund (1985, Chapter 4) to compute them approximately without 
using Monte Carlo or numerical integration. Letting (j) and $ be the standard 
normal density and c.d.f. and Too the smallest integer > pN, the normal ap- 



proximation ( 14) used in conjunction with Siegmund's (1985) boundary crossing 



probability approximation yields 

1 

2 



(v2A - l/V2A)(f>(v / 2A) \og(N/m ) + 40(V2A)/V2yl 



as an approximation to (12), and 



2C) 



<j>W2A) 



2A 



log ^N/m - 2 + Alog(C/A) 



(18) 



(19) 



as an approximation to ( 13 1 . The values of A and C can therefore be determined 
by first setting ( |18[ ) equal to ea and solving numerically, and then setting ( 19 ) 
equal to (1 — e)a and solving numerically. Replacing A by B in (18) yields 



an analogous approximation for the probability in (11), which can be solved 
numerically to find B. 



The modified Haybittle-Peto test with thresholds A, B, C satisfying (111 



( 13 ) has type I error rate a and never takes more than iV observations. It 



has asymptotically the smallest possible sample size of all tests with the same 
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or smaller type I and II error probabilities. This was proved by Lai and Shih 
(2004, Theorem 2(i)) in the context of group sequential tests, and their proof can 
also be used to establish the following "fully sequential" version, of particular 
interest in CAT. 

Theorem 1. Let < p < 1, and let X l7 X 2 , ... be i.i.d. observations from 
an exponential family with parameter 0. Let T a .j3,N be the class of all tests 
of Hq : 9 > 8 + taking no more than N but no fewer than pN observations 
and with error probabilities not exceeding a and ft at 6 = 9 + and 9_, the 

alternative for which the likelihood ratio test of 9 = 9 + versus 9 = 9_ based 
on N observations has type I and II error probabilities a and ft. If M is the 
sample size of the modified Haybittle-Peto test, then as a — > and ft — > such 
that log a ~ log/3, 

E e M ~ inf E e T (20) 

TeT a , fi , N 

for all 9. 



2.2 2.2 Extension to Sequentially Generated Experiments 

The primary motivation behind CAT is to reduce the length of the test by 
adaptively creating a test better suited to the individual examinee (see Bickel, 
Buyske, Chang, and Ying, 2001). This is accomplished by choosing an exami- 
nee's (k + l)st test item based on his/her previous responses «i, . . . ,itfc. Hence 
the responses are no longer i.i.d., violating a basic assumption in Theorem 1 
and also in the optimality theory of the SPRT. Another extension of Theorem 1 
that needs to be made for applications to CAT is that the exponential family 
of density functions in Theorem 1 has to be generalized to the form 

fe, j (x) = e x ^-^ 9 », jeJ, (21) 

where J is a set of experiments initially available. In particular, for CAT, whose 
likelihood function is given in ([2]), 

0(75(0)) = log(e'>( , ) + l) = -log[l- B (9)]. (23) 



Each item j in (21) is a reparamcterized exponential family, and when the Tj 



are smooth functions of 9, as in (22) (provided the pj are smooth), then the 
form of the exponential family (21) implies that ip is smooth also. Then the 
standard formulas for exponential families give 

EgjXi = 1/(^(9)), VaxejX i = ^'(T j (e)), (24) 

and therefore 

1,(9, A) = EgjloglfaiXJ/hjiXi)] 

= i''(r J (8))[T J (9) - Tj (X)] - [^(6)) - V(r,(A))]. (25) 
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Let ji denote the ith sequentially chosen experiment, and to avoid trivialities 
from selection rules that somehow look "into the future," we assume experiments 
are chosen according to some rule that involves only the previous observations 
X\, . . . , Xi_i. The likelihood function still has the form 

k 

Lk(e) = Hfe <ji (X i ) 

since the density function of X{ given X\, . . . , JQ_i is fe,ji- Hence the compu- 
tation of the error probabilities, and therefore also of the thresholds A, B, C, 
for the modified Haybittlc-Peto test in the present case can proceed in the same 
way as in Section 2.1, in which |J| = 1 and fe,ji(Xi) is simply fg(Xi). 

Before extending to the CAT setting where each item can only be used once, 
we extend Theorem 1 to the following setting in which an item can be used 
multiple times so that its information content can be learned by repeatedly 
using it, as in the case of nonlinear design of experiments (Fedorov, 1997) and 
as is used in sequential medical and psychological diagnosis. 

Theorem 2. Suppose that experiments are sequentially chosen from a set J 
by a rule 5 such that at stage z, the choice of ji depends only on X\, . . . , -Xj-i, 
that 

n 

f j = lim n^ 1 y P{ji = j} exists for every j, (26) 

i=l 

inf £ vrf'teWWiW? > fOT a11 « > 0, (27) 



and that the observations follow (21 1. Then (20) still holds for all 9, where M 
is the sample size of the modified Haybittle-Peto test and T a ,p,N = T a ,f},N{&) 
is the class of tests described in Theorem 1 that use 5 to select experiments at 
every stage prior to stopping. 

The proof of Theorem 2 is given in the Appendix, which also gives the 
asymptotic theory of the MLE and GLR statistics in sequentially generated 



experiments from (21 ) under the assumptions (26) and (27). This theory allows 



us to use the approximation (14) to compute the probabilities in (11)-(13) and 



thereby determine the thresholds A, B, C of the modified Haybittle-Peto test 



for the general exponential family considered here. The assumption (26) is a 



consistency requirement that the long-run frequency Vj with which experiment 
j is used must exist. For example, if experiments are completely randomized 
then Vj = 1/|J| for all j € J. As will be seen in the proof of Theorem 2 in 



the Appendix, the assumption (27) is a uniform convexity requirement of the 



information numbers (25 1 which can usually be routinely verified. For example, 



for any finite J satisfying (22)-(23), (27 1 will hold provided p'AQ) > for all j 



9 



and 9 (i.e., the Pj(0) are well defined with respect to 9) since, in this case, 



(28) 



for all j, \9\ < a. Hence, the left-hand-side of (27) is at least 7 > 0. 

We next modify Theorem 2 by imposing the additional constraint that each 
experiment can be used at most once, as in CAT. This restriction implies that 
J = J(N) with |J| > N and that we cannot learn about an experiment's 
efficiency directly by using it repeatedly. On the other hand, an experiment's 
efficiency can be learned indirectly through the estimate 9 of 9. In particular, 
suppose that J can be partitioned into a fixed number K of classes J\ , . . . , Jk 
with Jk = Jk{N) and such that the experiments in Jk give rise to observations 



that follow the same distribution in (21). That is, assume that 



J = U^ =1 Jk and for each k, there is such that Tj = for all j G Jk- 

' \ (29) 

In practice in CAT, these classes Jk may represent items with the same or similar 
item response properties. The asymptotic optimality of the modified Haybittlc- 
Peto test in this setting can be proved by the same arguments as those used 
in the proof of Theorem 2, provided the classes Jk satisfy some assumptions 



analogous to (26)-(27). This is the content of the following theorem, whose 



proof is given in the Appendix. 



Theorem 3. Suppose that J satisfies (29) and J > JV, that experiments are 



sequentially chosen by a rule S such that ji depends only on Xi, . . . , A^_i, that 

n 

= lim n -1 P{ji g J k } exists for every k=l,...,K, (30) 

i=l 

K 

inf J2u^ip"(T^(9))[T^ k) '(6)} 2 > for all a > 0, (31) 
' S '- a fc=i 



that the observations follow (21), and that experiments cannot be used more 
than once. Then the results of Theorem 2 hold for the modified Haybittle-Peto 
test, where T a ,p,N(§) is as described there. 



2.3 2.3 Application: Efficient Design of CMT 

To apply Theorem 3 to the design of efficient CMTs, we use item response 
theory (IRT) to model the probability pj (9) that an examinee of ability 9 gives 
the correct answer to item j. IRT is traditionally utilized in CMT to provide 
methods for adaptive item selection as well as to estimate and compare the 
respective abilities of examinees who were administered distinct sets of items. 
We assume in the sequel the three-parameter logistic (3-PL) model (Lord, 1980): 

PM = c j + 1 + 1 e : a %_ biy (32) 
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with known parameters (aj, bj, Cj) for all items j in the available item pool. 

Any CMT must have an item selection rule as well as a stopping rule. This 
item selection rule is adaptive in the sense that the choice of the fcth question 
for an examinee depends on U\, . . . , Uk-i, where the Uj are defined in ([I]) so that 
Pg{ui = V\U\, . . . ,tfj_i} = pj.{6), in which ji denotes the item chosen for the 
ith question. Most item selection rules in the literature maximize some index 
of psychometric information at a specified value of 9 to select the next item for 
a given examinee. One such index is the Kullback-Leibler (KL) information, 



which for the 3-PL model ( 32 ) is 



1,(0,0')= Pj (9) log A. + [1 - p, (9)] log ^f^T) ■ ( 33 ) 

The KL information Ij(9,6') is a measure of the distinguishability of the true 
ability level 9 from level 9' provided by item j. Another such measure used in 
CMT is the Fisher information, which for the 3-PL model is 



1,(9) = 



(cj + e a i( e ~ b j s >)(l + e~ a J'( 0-b J')) 2 



Reckase (1983), Lewis and Sheenan (1990), Spray and Reckase (1996), and 
Chang and Ying (2003) use procedures that choose the next item in a test to 
be the unused item that maximizes the Fisher information at the cut point 
#o or at a current estimate of 9, like the MLE 9k- Spray and Reckase (1996) 
suggest maximizing information at 9$ rather than 9k when using the SPRT. 
Eggen's (1999) simulations showed that KL information outperforms both of 
these approaches based on Fisher information in some settings. These adaptive 



item selection rules satisfy ( 30 1-( 31 ) as discussed following Theorem 2 



3 3. Simulation Studies 

3.1 3.1 Simulation of Proposed CMT 

In this section we compare the fixed-length, TSPRT, and modified Haybittlc- 
Peto tests of Hq : 9 > 9 + versus Hi : 9 < 9- about the ability level 9 in the 
3-PL model. To isolate the effects of the different stopping rules, all tests use the 
same criterion - maximum Fisher information - to sequentially choose items. 
To simulate the tests in a realistic setting, we used a real item pool from the 
Chauncey Group International, a subsidiary of the Educational Testing Service. 
The pool has 1136 3-PL item parameters, with a, ranging between 0.289 and 
2.372 and having a median of 0.862, bj ranging between -5.531 and 5.426 and 
having a median of -0.943, and Cj ranging between 0.048 and 0.529 and having 
a median of 0.232. The real-life cut point associated with the item pool is 
#o = —1.32. Mimicking simulations by Lin and Spray (2000), 6L and 9 + are 
taken to be 9 =p -25 = -1.07, -1.57. Following Spray and Reckase (1996), N 
was set to 50. Table [l] gives the type I error probability and the average length 
of the TSPRT for various values of a = j3. 
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INSERT TABLE Q] ABOUT HERE 

As mentioned above, the TSPRT with thresholds ([5| and Q usually has type 
I and II error probabilities substantially larger than the nominal values a and 
(3. Table [T] shows that the actual type I error probability is roughly constant 
at about .16 for a < .1. This is because the thresholds A = B = log((l — 
a)/ a) are large enough that truncation occurs for nearly every examinee, evident 
through the large average test lengths, and consequently a large proportion of 
the examinees are misclassified at the truncation point. Since the type I error 
probability in CMT is the percentage of proficient # + -level examinees who are 
misclassified as non-proficient, we propose a modification of the TSPRT by 
choosing C suitably to make the type I error probability approximately equal 
to the nominal value a, rather than use Table [2] contains the average test 
length and percentage of examinees classified as non-master, i.e., the power, of 
the following tests at various values of 9: The TSPRT using thresholds ([7| 
with a = P = .05; the TSPRT, modified in the way described above (denoted by 
modTSPRT), with the same values of A = B = log(.95/.05) but with C = 1.4 to 
give type I error a = .05; the fixed-length test with N = 50, using classification 
rule ([6| with C — 1.28 that is chosen to give type I error probability a = .05; the 
modified Haybittle-Peto test (denoted by modHP) with A — 3.7, B = 3.3, C = 



1.4 that are chosen to satisfy (|lTj)-( 13 ) with a = j3 = .05, e = 1/2, pN = 5, and 
9^ N ^ — —1.95 where the fixed-length test has power 1 — j3 = .95; details of how 

A, B, C and 9^ were computed are given below. All four tests choose the next 
item to be the unused one that maximizes Fisher information at the MLE, when 
it exists. When the MLE does not exist, the tests use Fisher information at the 
real-life cut point 9 — 9q. The average test length and power are computed at 
eleven values of 9 between —2 and —.5, including 9 , 9 + (in bold), 6>_, and 9_ 
from 10,000 simulated tests each. 

INSERT TABLE ABOUT HERE 

The fixed, modTSPRT, and modHP tests have very similar power functions 
for 9 < 9 + . The TSPRT has high power but also greatly inflated type I er- 
ror probability 16.1%, resulting from the use of the approximations Q in 
its definition, as discussed above. The modTSPRT has the same average test 
length as the TSPRT because they use the same thresholds A and B, and both 
provide savings in test test length over the fixed-length test, particularly at 
ability levels outside the indifference region 9 + ). The modHP test provides 
substantial savings in test length over the fixed length as well as the TSPRT 
and modTSPRT. The self-tuning nature of the GLR allows modHP to dramat- 
ically shorten the tests of proficient examinees (9 > 9 + ), for whom modHP is 
about half the length of modTSPRT. Moreover, the modHP tests are shorter 
on average even when 9 = 9 + or Q_, suggesting that the method of computing 



thresholds ( 11 (-( 13 ) contributes to its efficiency as well as the use of the GLR 
statistic. 

The parameters A, B, C and 9^ of modHP were computed by Monte Carlo 
simulation as follows. Two simple numerical routines were used: Routine 1, 
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which takes as input a candidate value 9_ < 9 + and returns the estimated 
type II error probability of the fixed-length level-a likelihood ratio test of 9 = 9 + 
vs. 9 = 6_ of length N, based on 10,000 simulated tests using the Chauncey 
ite m po ol; and Routine 2, which takes as inputs boundaries A,B and C in 
(|sj)- ( 10 ) , 9^\ and the true ability level 9, and returns the average test length 
and estimated type I error probability of modHP using Fisher information with 
maximum length TV = 50, based on 10,000 simulated tests using the Chauncey 
item pool. Routine 1 works by first solving for the critical value C in Q (with 
9- replaced by 9_) giving type I error probability a = .05. This is done by 
finding two values of C whose corresponding type I error probabilities bracket 
.05 and then using the bisection method. After C has been found, the type II 
error probability is output by simulating tests at 6 = 9± and bracketing and 
bisection are again used to find the value 9^ giving type II error probability 
(3 = .05. The value 9 {N) = -1.95 was found in this way. Next, Routine 2 was 
used to find A, B and C in |i|-([lo| by first setting A = C = oo and 9 = 9 {N) 



and finding B that satisfies (|1 ljpwith ef3 = .05/2 = .025 by bracketing and 
bisection. B = 3.3 was found in this way and used to next find A satisfying ( 12 1 
with ea = .05/2 = .025 by simulating at 9 = 9 + . A = 3.7 was found and used 
to similarly find C = 1.4 satisfying Q with (1 - e)a = .05/2 = .025. Both 
Routines 1 and 2 are stable and run quickly, and A, B, C and 9^ are computed 
in a matter of minutes. 



3.2 3.2 Simulation of Proposed CMT with Exposure Con- 
trol and Content Balancing 

Even though the example in Section 3.1 utilizes a real item pool, the tests 
are compared under somewhat ideal circumstances where items can be selected 
purely due to their statistical properties. However, since the modified Haybittle- 
Peto test presented above relies on no specific item selection rule or IRT model, 
it has the flexibility to incorporate additional constraints on item selection that 
arise in typical CATs, such as exposure control and content balancing in the 
choice of items. In this section we illustrate this by presenting a second simu- 
lation study comparing the modified Haybittle-Peto test with the TSPRT and 
fixed length test, all using the following simple method for exposure control and 
content balancing. 

Suppose that the exposure of the items in the pool needs to be controlled 
so that each item is administered to no more than a proportion 7r of examinees, 
on average. Suppose also that the content of the test needs to be balanced in 
the sense that each item in the pool falls into one of s categories, and these 
categories should be represented approximately in given proportions qx,...,q s , 
where qi = 1. A simple way of satisfying these constraints when using a test 
of maximum length N is the following. From each category i = 1, . . . , s, first 
select the Nqi / ir (neglecting rounding) items with the largest Fisher information 
at the cut-point 9 , then randomly select Nqi items from among these, resulting 
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in a new item pool of J^i Nqi — N items, the proportion of which are in 
category i. The chance that a given item in category i appears in the new 
pool is clearly no greater than Nqi/{Nqi/ii) = tt. If a test that allows early 
stopping is being used, like the TSPRT or modified Haybittle-Peto test, then 
the method of spiraling (Kingsbury and Zara, 1989) can be used so that the 
category proportions are close to qi,...,q s even when early stopping occurs; 
spiraling simply entails choosing at the (fc + l)st stage an item from the category 
i whose proportion in the first k items differs the most from q,. 

INSERT TABLE [3] ABOUT HERE 

Table [3] contains the average test length and power of the fixed-length (N — 
50), TSPRT, modified TSPRT, and modified Haybittle-Peto tests using this 
method of exposure control and content balancing, for various values of 9 (with 
9 + in bold). For this study, the Chauncey item pool used in Section 3.1 was 
randomly divided into s = 3 "content" categories, 7r was set to .25, and q\ = .4, 
q 2 = .3, q 3 = .3. Each entry in Table [3] was computed from 10,000 simulated 
tests. The fixed-length (N = 50) test uses classification rule ^ with C = 1.33, 
chosen to achieve type I error probability about a = .05. The TSPRT uses 
the stopping rule @-([7]) with a = (3 = .05, and the modified TSPRT (denoted 
by modTSPRT) uses the same values of A and B but with C = 1.3 to ensure 
type I error probability of a = .05, as discussed in Section 3.1. The modified 
Haybittle-Peto test (denoted by modHP) uses A = 3.7, B = 3.8, C = 1.47 that 



are chosen to satisfy (11 )-( 13 ) with a = /3 = .05, e = 1/2, pN = 5, and 9^ 
—2.11, where the fixeadength test has power 1 — f3 = .95; these parameters 
were computed using Monte Carlo simulation similar to the last paragraph of 
Section 3.1. The tests show very similar relative performance to those in Table[2j 
The modTSPRT and modHP tests have power functions very similar to the 
fixed- length test, while the TSPRT is over-powered, including an inflated type 
I error probability of 19.3% that results from use of the approximations ([3]), 
Q, |7]) in its definition, as discussed above. The modHP tests are substantially 
shorter than the TSPRTs for all values of 9 considered, and particularly for 
9 > 9+ where the reduction was around 40% to 50%. Note that the tests in 
Table [3] are less powerful and on average longer than the corresponding ones 
in Table [2] this is because they do not always choose the most informative 
item available in order to satisfy the exposure control and content balancing 
constraints. 



4 4. Conclusion 

This paper shows how efficient sequential tests that use "self-tuning" sequential 
GLR statistics can be extended from the i.i.d. setting to incorporate sequentially 
designed experiments. The tests are also sufficiently general to handle practical 
issues that arise in computerized adaptive testing applications, like the method 
used in Section 3.2 to satisfy the constraints on exposure control and content 
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balancing or the more complex methods proposed by Sympson and Hetter (1985) 
and Stocking and Swanson (1993). These tests have potential applications in 
psychometric testing with sequentially generated experimental designs and data- 
dependent stopping rules, as illustrated in Sections 2.3 and 3 for CMT. 



Appendix: Proof of Theorems 2 and 3 and Re- 
lated Asymptotic Theory 

In order to prove Theorem 2, we modify the basic arguments of Lai and Shih 
(2004) that prove Theorem 1, and whose key ingredients are the following. 

(a) Hoeffding's (1960) lower bound for the expected sample size EgT of a 
test that has error probabilities a and /3 at 9 = 9 + and 9_, which simplifies 
asymptotically to 

E e iT) > (l + (l))|loga|/max{/(M + ),/(M W )} (Al) 

as log a ~ log/3, where 1(9, A) = E e {\og[ f 6 {X t ) / f x {Xi)]} = (9 - X)^'(9) - 
{ip(9) — ^(A)} is the Kullback-Leibler information. 

(b) The sample size TV of the fixed-sample-size likelihood ratio test of 9 = 9 + 
versus 9 = 6_ with error probabilities a and /3 at 9 = 9 + and 0_ , which 
satisfies 

N~\l ga\/I(6*,6 + ) (A2) 

as log a ~ log/3, where 9_ < 9* < 9 + is the unique solution of 1(9* ,9+) = 
1(9*, 9^). Moreover, max{I(9,9 + ), I(9,9 {N) )} attains its minimum at 9 = 9*. 

(c) linin^oo P e {max pn < m <„ \9 m - 9\ > a} = for every a > 0. 

To extend this to Theorem 2, we need analogs of (a), (b), and (c) to hold for 
the case of sequentially generated experiments. Without assuming the to be 
independent, Lai (1981, Theorem 2) has derived a Hoeffding-type lower bound 
which in our case takes the form 

Eg(T) > (l + (l))|loga|/max J ]>>^(M + ), ]>>^(M W ) 1 , (A3) 

[jeJ je.J I 



where Vj = linin-yoo n 1 Y17=i = j} exists by (26) and Ij(9,X) is given 



by (25). Lai's (1981, Theorem 2) bounds are derived for sequential tests of 
H : P = P versus H x : P = Pi with type I and type II error probabilities a 
and /3, based on random variables Xi,X2, ■ ■ ■ from a distribution P such that 
(.Xi, . . . , X m ) has joint density function p m (xi, . . . , x m ), under the assumptions 
that for k = 0, 1, 

■nT 1 \og[p n (X x , X n )/p n ,k(Xi, X n )} converges in probability to r\ k , (A4) 
lim P \ maxlog[p m (Xi, . . . , X m )/p m , k (X 1 , X m )] > (1 + 5)nr) k > = for every S (AS) 
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where p nj k denotes the joint density function under Hk, k = 0,1. These condi- 
tions hold in the present case, for which 



(A6) 



i=l 



by ([2l|, and r) a = Y J jeJ v i I o^^+) and = ^IjeJ v i I A G ^ ^-'), which can 
be shown by the following argument. We shall use P to denote the probability 

p 

measure under which the true parameter value is and — > to denote convergence 
in probability under this measure. The notation op(l) wll be used to denote a 



random variable Y n such that Y n — > 0. From (261 it follows that 

n 



and combining this with ( 24 1 and the law of large numbers applied to ( A6 1 
yields 



rr 1 \og(L n (6)/L n (0 + )) 



1 £{^M0))foO?) - r jt (B+)] - [^(9)) - 1>{T St {B+))]} + o P (l) 



i=l 



-) + Op(l) 770, 



where we have used ( 25 ) for the second equality. The same argument can be 
used to show that 



n- 1 \og{L n {e)/L n {8 {N) )) - n[ N) 4 0. 



By Taylor's expansion of (25), 



I S (0,\) 



(e-A)V'(r J (A))[r;(A)] 2 /2 + o(0-A) s 
(X-err(rA0))[r'm 2 /2 + o(X-ef 



for fixed A, (A7) 
for fixed 6. (A8) 



Hence the assumption |27| guarantees uniform convexity of the information 



numbers, which can be used in conjunction with (26) to show that (c) still 
holds in the setting of Theorem 2. Moreover, modification of the proof of 
Theorem 3 and equation (15) in Lai and Shih (2004) can be used to show 
that as loga ~ log/3, N ~ \loga\/J2j^j ,J jIj(6*,0+) analogous to (A2), where 



E Je j" 3 MF>0+) = E je j"3W,0™), ^d that 



E e M ~ |loga|/max<( Vz^-(6>,(9+), Y^Vjlii 6 , 6 ^) } ~ inf E e T i 

' jeJ I Ter a ,,Mt) 



(A9) 
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proving Theorem 2. 

Theorem 3 is proved analogously by replacing the "items" in Theorem 2 



analogs to (26l-(27) 



by the "item classes" Jk satisfying (29). The conditions ( 30 )-( 31 1 provide the 



In the sequel we let 9q denote the true parameter value to study the asymp- 
totic properties of the MLE 9 m and the GLR statistics in sequentially designed 



experiments that satisfy ( 26 ) and ( 27 ) . Note that (c) ensures that with proba- 
bility approaching 1, 9 m is near 9 for all pn < m < n. A standard argument 
involving martingale central limit theorems (Durrett, 2005, p. 411) and Taylor's 
expansion of log L m (9) around 9q can be used to show that as n — > oo 




-9 ) has a limiting standard normal distribution, 

(A10) 



and that the signed-root likelihood ratio statistics in (14 1, with 9 replaced by 
#o, are asymptotically normal with independent increments, generalizing 
from the i.i.d. case to sequentially generated experiments. 
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Table 1: Type I error probability Pg + {reject H } and average test length Eg + T 
of TSPRT using Chauncey item pool with maximum test length N — 50 and 
thresholds A = B = log((l - a) /a)), C = 0. 





a = .001 


.005 


.010 


.050 


.100 


.200 


Pe + {reject H } 


.165 


.163 


.163 


.161 


.165 


.193 


Eg + T 


50.0 


50.0 


49.7 


44.2 


36.7 


22.4 



Table 2: Average test length and power (in parentheses) of the fixed-length, 
TSPRT, modified TSPRT (modTSPRT), and modified Haybittle-Pcto (modHP) 
tests using the Chauncey item pool. 



6 


fixed 


TSPRT 


modTSPRT 


modHP 


-0.50 


50.0 


(0.04%) 


27.6 


(0.23%) 


27.6 


(0.04%) 


12.6 


(0.13%) 


-0.75 


50.0 


(0.28%) 


34.7 


(1.53%) 


34.7 


(0.24%) 


15.8 


(0.46%) 


-1.00 


50.0 


(2.70%) 


42.6 


(10.5%) 


42.6 


(2.57%) 


22.1 


(3.27%) 


6+ = -1.07 


50.0 


(5.00%) 


44.2 


(16.1%) 


44.2 


(5.00%) 


24.5 


(5.00%) 


-1.25 


50.0 


(17.5%) 


46.5 


(39.0%) 


46.5 


(17.1%) 


30.5 


(17.1%) 


O = -1.32 


50.0 


(25.5%) 


46.6 


(49.2%) 


46.6 


(24.0%) 


32.4 


(25.1%) 


-1.50 


50.0 


(51.7%) 


44.2 


(75.6%) 


44.2 


(49.7%) 


35.0 


(49.1%) 


0_ = -1.57 


50.0 


(62.9%) 


42.3 


(83.3%) 


42.3 


(60.3%) 


34.7 


(59.4%) 


-1.75 


50.0 


(83.3%) 


36.3 


(94.8%) 


36.3 


(82.6%) 


30.5 


(80.4%) 


6™ = -1.95 


50.0 


(95.0%) 


29.3 


(99.0%) 


29.3 


(93.5%) 


23.6 


(92.2%) 


-2.00 


50.0 


(95.4%) 


27.8 


(99.3%) 


27.8 


(94.3%) 


22.1 


(93.2%) 
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Table 3: Average test length and power (in parentheses) of the fixed-length, 
TSPRT, modified TSPRT (modTSPRT), and modified Haybittle-Pcto (modHP) 
tests with exposure control and content balancing. 



e 


fixed 


TSPRT 


modTSPRT 


modHP 




-0.50 


50.0 


(0.03%) 


36.7 


(0.45%) 


36.7 


(0.00%) 


17.6 (0.14%) 




-0.75 


50.0 


(0.43%) 


42.1 


(2.71%) 


42.1 


(0.31%) 


21.7 (0.64%) 




-1.00 


50.0 


(3.04%) 


46.7 


(14.1%) 


46.7 


(3.27%) 


28.0 (3.12%) 


o+ 


= -1.07 


50.0 


(5.00%) 


47.5 


(19.2%) 


47.5 


(5.00%) 


29.8 (5.00%) 




-1.25 


50.0 


(14.7%) 


48.4 


(39.8%) 


48.4 


(15.4%) 


34.3 (13.9%) 


Bo 


= -1.32 


50.0 


(21.1%) 


48.3 


(49.1%) 


48.3 


(21.6%) 


35.7 (20.3%) 




-1.50 


50.0 


(40.9%) 


47.0 


(71.8%) 


47.0 


(42.9%) 


38.1 (38.0%) 


9- 


= -1.57 


50.0 


(50.1%) 


46.1 


(78.9%) 


46.1 


(51.4%) 


37.9 (47.5%) 




-1.75 


50.0 


(72.0%) 


42.7 


(91.1%) 


42.7 


(73.0%) 


36.4 (67.3%) 


(N) 


-2.00 


50.0 


(91.2%) 


35.9 


(98.3%) 


35.9 


(91.7%) 


30.2 (86.2%) 


= -2.11 


50.0 


(95.0%) 


33.1 


(99.2%) 


33.1 


(95.3%) 


27.3 (91.2%) 
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